Slides: Assessing Trends in World News

Analyzing Reddit Data Using Pyspark-AWS Framework

About the Team

Aaron A. Genin

Lucienne L. Julian

Peijin Li

Sonali S. Rathinam

Overview

  • Summary of Results
  • Introduction to the Data
  • Natural Language Processing
  • Machine Learning
  • AWS Infrastructure and Methodology

Key Takeaways

  • r/worldnews subreddit
  • Primarily Western Viewpoint
  • Gaps in News Coverage (ACLED)
  • Russia-Ukraine Conflict dominate the Topic Space
  • Spacial granularity in NER
  • Plurality of Negative Sentiment

Introduction to Data

  • Subreddit: Assumed Neutrality, Popularity
  • User Activity: 27000 Distinct Posters, 1.2 million Distinct Commentors
  • Live Threads: Daily Coverage of Conflict
  • Surge in Submissions/Comments at the Onset of War
  • Russia-Ukraine Conflict: 27% Posts, 45% Comments

Gaps in Data Collection

Introduction to Data (Continued..)

  • Western inclination challenges assumed neutrality
  • Divergent Pattern of User Behaviour in Social Media Sites
  • ACLED Aggregated Conflict Events demonstrate gaps in news coverage

Most Shared News Sites

Results…

Natural Language Processing Results

Topic Modeling :

  • Russia-Ukraine War Topics Dominate
  • Facets of Conflict

Named Entity Recognition (NER) :

  • NER reinforces the prevalance of War Posts
  • Location Entities widely used

War Posts Frequent in the Topic Space

Location Based Entities Dominate the Posts

Machine Learning Results

Sentiment Analysis:

  • 4 models : 3 pretrained and 1 lexicon
  • VADER assumed to be most accurate
  • Vivek Model for Submissions :

45% 30% 25%

Predominantly Negative Sentiments Across Models

AWS Infrastructure and Methodologies Employed

AWS Pipeline

Methodologies Employed and Other Tools

  • johnsnowlabs (pretrained models)
  • Latent Dirchlet Allocation
  • VADER
  • GitHub
  • Quarto

Website Preview

Preview